Conversation
Comprehensive test framework comparing pre-DDP vs post-DDP AWQ quality across 5 models, 3 quantization formats (W4A16_ASYM, W8A8, W8A16), and 3 benchmarks (gsm8k, wikitext, mmlu) via vLLM lm_eval. Models: Llama-3-8B, Qwen3-VL-8B, Llama-4-Scout-17B, Qwen2.5-32B, Mixtral-8x7B Features: - Skips quantization/eval if results already exist (resumable) - Cleans up quantized models after eval to save disk - vLLM eval fallback chain (TP -> expert_parallel -> TP=1 -> eager -> hf) - Applies chat template and fewshot_as_multiturn for eval - Progressive results summary table with extracted metrics - Custom AWQ mappings for Mixtral (w1/w2/w3) and Llama-4-Scout (scoped to language_model) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive regression test suite for AWQ quantization, specifically designed to compare pre-DDP and post-DDP code states. It includes individual test scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5, and Qwen 3 VL, alongside a log extraction utility and an automation script. Feedback focuses on improving the portability and efficiency of the automation script, resolving a documentation contradiction regarding model cleanup, and ensuring the suite actually exercises DDP code paths by correctly configuring multi-GPU quantization.
| # ── Helper: activate environment ───────────────────────────────────────────── | ||
|
|
||
| activate_env() { | ||
| source /home/HDCharles/rhdev/bin/activate |
| MODEL_VLLM_ARGS=( | ||
| "2048,1,1" | ||
| "4096,1,1" | ||
| "4096,2,1" | ||
| "4096,2,1" | ||
| "2048,2,1" | ||
| ) |
There was a problem hiding this comment.
The num_gpus_quant parameter (the third value in each entry of MODEL_VLLM_ARGS) is set to 1 for all models. This means the script never executes the torchrun path (line 279) and only performs single-GPU quantization. This prevents the "DDP regression test suite" from actually exercising Distributed Data Parallel (DDP) code paths in llmcompressor, which is the primary objective of this PR.
| # ./run_all_tests.sh 2>&1 | tee regression_results.log | ||
| # python extract_log_summary.py regression_results.log | ||
| # | ||
| # Models are saved to disk and NOT cleaned up for follow-up evaluation. |
| for scheme in "${SCHEMES[@]}"; do | ||
| for code_state in "${CODE_STATES[@]}"; do |
There was a problem hiding this comment.
The current loop structure is inefficient. It iterates through models and schemes, and for each combination, it switches the git branch and re-installs the package (pip install -e .). This results in 30 separate installations (5 models * 3 schemes * 2 states). Reordering the loops to checkout each code state once and run all applicable models/schemes for that state would significantly reduce execution time.
Summary by CodeRabbitNew Features
Tests
WalkthroughIntroduces a comprehensive AWQ quantization regression test suite with model-specific quantization scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5-32B, and Qwen 3 Vision-Language models, along with automated testing orchestration and result summarization utilities. Changes
Sequence DiagramsequenceDiagram
participant User
participant Script as AWQ Script
participant Model
participant Dataset
participant AWQModifier
participant Quantizer
participant Generator
User->>Script: Execute with scheme & num_samples
Script->>Model: Load pretrained model & tokenizer
Script->>Dataset: Load calibration dataset
Script->>Dataset: Preprocess (chat template & tokenize)
Script->>AWQModifier: Construct recipe with scheme & ignore rules
Script->>Quantizer: Execute oneshot(dataset, recipe, config)
Quantizer->>Model: Apply quantization
Quantizer-->>Script: Return quantized model
Script->>Model: Dispatch for execution
Script->>Generator: Generate sample text
Generator-->>Script: Return generated tokens
Script-->>User: Save quantized model & report metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/awq/regression_tests/llama3_awq.py`:
- Around line 29-30: The model-loading call using
AutoModelForCausalLM.from_pretrained currently passes dtype="auto" which is
ignored; change the parameter name to torch_dtype="auto" so the model is loaded
with the correct precision (e.g., update the call in llama3_awq.py where
AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") is used and make
the identical change in qwen25_32b_awq.py for its
AutoModelForCausalLM.from_pretrained call; keep the
AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls unchanged.
In `@examples/awq/regression_tests/llama4_scout_awq.py`:
- Around line 19-128: Run ruff format on the file to fix the style errors
reported by CI: run `ruff format` (or your editor's ruff formatting) and stage
the resulting changes; focus on formatting inside the main() block and helper
functions (main, preprocess_function, data_collator, and the
AWQModifier/AWQMapping recipe) so the file passes `ruff format --check` and the
CI quality job turns green. Ensure you do not alter logic—only whitespace, line
breaks, and import/style formatting—then commit the formatted file.
In `@examples/awq/regression_tests/qwen25_32b_awq.py`:
- Around line 29-30: The model is being loaded with the wrong keyword argument:
replace the incorrect dtype="auto" used in AutoModelForCausalLM.from_pretrained
with the correct torch_dtype="auto" so the 32B model loads in the intended
precision; update the call to AutoModelForCausalLM.from_pretrained(MODEL_ID,
torch_dtype="auto") (keeping AutoTokenizer.from_pretrained(MODEL_ID,
trust_remote_code=True) unchanged).
In `@examples/awq/regression_tests/qwen3_vl_awq.py`:
- Around line 103-127: The smoke test currently pulls a remote image URL inside
the messages list (the "image": "http://images.cocodataset.org/...") which
introduces a network dependency that can fail before quantization is saved;
update the messages construction so it uses a local/test image already present
in the harness (e.g., reuse an existing dataset image buffer or path used
earlier) or make the generation step best-effort by wrapping the
vision-processing/generation sequence (process_vision_info, processor(...),
model.generate) in a try/except that logs the failure and continues without
marking the quantization as failed; target the messages variable,
process_vision_info invocation, the processor(...) call and the
model.generate(...) call when applying the change.
In `@extract_log_summary.py`:
- Around line 41-50: The current code searches the whole content for the
MODEL/CODE STATE pattern so repeats reuse the first global match; instead
restrict the search to the current scheme's block: first locate the scheme
header for this iteration (e.g., find the scheme-start boundary in content),
then run re.search(pattern, scheme_block, re.DOTALL) so match, start,
next_section and body are computed relative to that scheme_block (not the full
content); update the logic around pattern, match, start, next_section and body
to slice content to the scheme range before extracting the MODEL and CODE STATE
section.
In `@run_all_tests.sh`:
- Around line 267-269: The cleanup currently removes $save_dir unconditionally
which defeats the earlier quantization skip; update the flow in run_all_tests.sh
so that before doing any quantization/eval work you check for the presence of
the final evaluation outputs (results_*.json) and, if they already exist, skip
the entire quantize+eval block (the code guarded by the $save_dir existence
check and the later removal code around lines referencing results_*.json and the
rm -rf $save_dir cleanup); alternatively, defer deleting $save_dir until after
you confirm all eval outputs have been produced—ensure this change touches the
quantize check that uses $save_dir and the later cleanup block that deletes
$save_dir so resumable runs do not force re-quantization.
- Around line 58-60: The activate_env() function currently hardcodes a personal
path (/home/HDCharles/rhdev/bin/activate); change it to not assume that path by
either: 1) removing the function and documenting that the caller must activate
the virtualenv, or 2) making activate_env() read a venv path from an environment
variable (e.g. VENV_ACTIVATE or VENV_PATH) or a CLI argument, validate that the
file exists before sourcing, and emit a clear error and exit if not set/invalid;
update any callers to pass the variable or rely on the caller-activated
environment.
- Around line 204-215: The current switch_code_state() function checks out
historical commits in-place which removes the test harness and newly added
regression scripts (examples/awq/regression_tests/*); change the approach to
avoid in-place checkouts by either (A) using git worktree or a fresh clone:
create a separate worktree/clone for PRE_DDP_COMMIT and POST_DDP_COMMIT, run pip
install -e . in those isolated directories and invoke the regression scripts
from those worktrees, or (B) before any checkout, copy the harness and
examples/awq/regression_tests/* to a temporary directory and run the tests from
that safe copy; update switch_code_state() and the test invocation logic to
reference the worktree/temporary copy instead of the current working tree so the
scripts introduced by the PR are never removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 85ec6662-14e8-47f9-9c74-4a521f89aaa5
📒 Files selected for processing (7)
examples/awq/regression_tests/llama3_awq.pyexamples/awq/regression_tests/llama4_scout_awq.pyexamples/awq/regression_tests/mixtral_awq.pyexamples/awq/regression_tests/qwen25_32b_awq.pyexamples/awq/regression_tests/qwen3_vl_awq.pyextract_log_summary.pyrun_all_tests.sh
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the actual code at the specified lines
echo "=== Checking llama3_awq.py lines 25-35 ==="
cat -n examples/awq/regression_tests/llama3_awq.py | sed -n '25,35p'
echo ""
echo "=== Checking qwen25_32b_awq.py ==="
if [ -f "examples/awq/regression_tests/qwen25_32b_awq.py" ]; then
grep -n "AutoModelForCausalLM.from_pretrained" examples/awq/regression_tests/qwen25_32b_awq.py
else
echo "File not found"
fi
echo ""
echo "=== Checking llama4_scout_awq.py ==="
if [ -f "examples/awq/regression_tests/llama4_scout_awq.py" ]; then
grep -n "AutoModelForCausalLM.from_pretrained" examples/awq/regression_tests/llama4_scout_awq.py
else
echo "File not found"
fiRepository: vllm-project/llm-compressor
Length of output: 697
🌐 Web query:
AutoModelForCausalLM.from_pretrained dtype parameter torch_dtype hugging face transformers
💡 Result:
The torch_dtype parameter in AutoModelForCausalLM.from_pretrained specifies the PyTorch data type (dtype) for loading the model weights, overriding the default torch.float32. It accepts a torch.dtype (e.g., torch.float16, torch.bfloat16) or the string "auto". When "auto" is used: - Checks config.json for a torch_dtype or dtype entry. - If absent, infers from the dtype of the first floating-point weight in the checkpoint. Examples: import torch from transformers import AutoModelForCausalLM # Load with specific dtype model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.bfloat16) # Auto-detect dtype model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype="auto") Note: torch_dtype primarily affects parameters; some buffers with explicit dtypes (e.g., torch.float32) remain unchanged. In recent versions (post-v4.XX), it is also referred to as dtype in examples, but torch_dtype remains valid. This parameter helps reduce memory usage (e.g., FP16/BF16) and is passed via kwargs to the underlying PreTrainedModel.from_pretrained method.
Citations:
- 1: https://hf.co/docs/transformers/models
- 2: https://huggingface.co/docs/transformers/v4.53.2/models
- 3: https://huggingface.co/transformers/v4.10.1/main_classes/model.html
- 4: https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/model
- 5: https://huggingface.co/transformers/v4.11.3/main_classes/model.html
- 6: from_pretrained torch_dtype DO NOT affect model buffers huggingface/transformers#30709
- 7: torch_dtype='auto' is not working when using AutoModel.from_pretrained(...) huggingface/transformers#24774
Change dtype parameter to torch_dtype in model-loading calls.
Line 29 uses dtype="auto" but the Transformers API requires torch_dtype="auto". Using the incorrect parameter name causes it to be ignored, loading the model at default precision (float32) instead of auto-detecting the optimal dtype. This undermines the regression test by producing invalid memory and timing measurements.
The same issue exists in examples/awq/regression_tests/qwen25_32b_awq.py on line 29.
Suggested changes
- model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")Apply the same fix to qwen25_32b_awq.py.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/awq/regression_tests/llama3_awq.py` around lines 29 - 30, The
model-loading call using AutoModelForCausalLM.from_pretrained currently passes
dtype="auto" which is ignored; change the parameter name to torch_dtype="auto"
so the model is loaded with the correct precision (e.g., update the call in
llama3_awq.py where AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
is used and make the identical change in qwen25_32b_awq.py for its
AutoModelForCausalLM.from_pretrained call; keep the
AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls unchanged.
| def main(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--scheme", default="W4A16_ASYM") | ||
| parser.add_argument("--save-dir", default=None) | ||
| parser.add_argument("--num-samples", type=int, default=NUM_CALIBRATION_SAMPLES) | ||
| args = parser.parse_args() | ||
|
|
||
| num_samples = args.num_samples | ||
|
|
||
| model = Llama4ForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto") | ||
| processor = Llama4Processor.from_pretrained(MODEL_ID) | ||
|
|
||
| ds = load_dataset( | ||
| DATASET_ID, name="LLM", split=f"train[:{num_samples}]" | ||
| ) | ||
|
|
||
| def preprocess_function(example): | ||
| messages = [] | ||
| for message in example["messages"]: | ||
| messages.append( | ||
| { | ||
| "role": message["role"], | ||
| "content": [{"type": "text", "text": message["content"]}], | ||
| } | ||
| ) | ||
|
|
||
| return processor.apply_chat_template( | ||
| messages, | ||
| return_tensors="pt", | ||
| padding=False, | ||
| truncation=True, | ||
| max_length=MAX_SEQUENCE_LENGTH, | ||
| tokenize=True, | ||
| add_special_tokens=False, | ||
| return_dict=True, | ||
| add_generation_prompt=False, | ||
| ) | ||
|
|
||
| ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names) | ||
|
|
||
| def data_collator(batch): | ||
| assert len(batch) == 1 | ||
| return { | ||
| key: ( | ||
| torch.tensor(value) | ||
| if key != "pixel_values" | ||
| else torch.tensor(value, dtype=torch.bfloat16).squeeze(0) | ||
| ) | ||
| for key, value in batch[0].items() | ||
| } | ||
|
|
||
| # Llama-4-Scout has both vision_model and language_model sub-models, | ||
| # so mappings must be scoped to language_model to avoid dual matches. | ||
| # The main experts use a fused gate_up_proj (not Linear), so only | ||
| # shared_expert Linear layers are AWQ targets. | ||
| recipe = AWQModifier( | ||
| targets="Linear", | ||
| scheme=args.scheme, | ||
| ignore=[ | ||
| "re:.*lm_head", | ||
| "re:.*self_attn", | ||
| "re:.*router", | ||
| "re:.*vision_model.*", | ||
| "re:.*multi_modal_projector.*", | ||
| "Llama4TextAttention", | ||
| ], | ||
| mappings=[ | ||
| AWQMapping( | ||
| "re:.*language_model.*post_attention_layernorm$", | ||
| [ | ||
| "re:.*shared_expert.gate_proj$", | ||
| "re:.*shared_expert.up_proj$", | ||
| ], | ||
| ), | ||
| AWQMapping( | ||
| "re:.*shared_expert.up_proj$", | ||
| ["re:.*shared_expert.down_proj$"], | ||
| ), | ||
| ], | ||
| ) | ||
|
|
||
| torch.cuda.reset_peak_memory_stats() | ||
| start_time = time.time() | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| dataset=ds, | ||
| recipe=recipe, | ||
| max_seq_length=MAX_SEQUENCE_LENGTH, | ||
| num_calibration_samples=num_samples, | ||
| data_collator=data_collator, | ||
| sequential_targets=["Llama4TextMLP"], | ||
| ) | ||
|
|
||
| elapsed_time = time.time() - start_time | ||
| peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3) | ||
| print("Quantization Complete") | ||
| print(f"Time: {elapsed_time / 60:.2f} minutes ({elapsed_time:.2f} seconds)") | ||
| print(f"Peak GPU Memory: {peak_memory_gb:.2f} GB") | ||
|
|
||
| save_dir = args.save_dir or ( | ||
| MODEL_ID.rstrip("/").split("/")[-1] + f"-{args.scheme}" | ||
| ) | ||
| model.save_pretrained(save_dir, save_compressed=True) | ||
| processor.save_pretrained(save_dir) | ||
| print(f"Model saved to {save_dir}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
Run ruff format on this file before merge.
CI is already red on examples/awq/regression_tests/llama4_scout_awq.py due to ruff format --check, so this file needs a formatting pass to get the quality job green.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/awq/regression_tests/llama4_scout_awq.py` around lines 19 - 128, Run
ruff format on the file to fix the style errors reported by CI: run `ruff
format` (or your editor's ruff formatting) and stage the resulting changes;
focus on formatting inside the main() block and helper functions (main,
preprocess_function, data_collator, and the AWQModifier/AWQMapping recipe) so
the file passes `ruff format --check` and the CI quality job turns green. Ensure
you do not alter logic—only whitespace, line breaks, and import/style
formatting—then commit the formatted file.
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the actual content at lines 29-30 in the file
cat -n examples/awq/regression_tests/qwen25_32b_awq.py | sed -n '25,35p'Repository: vllm-project/llm-compressor
Length of output: 504
🏁 Script executed (no clone):
#!/bin/bash
# Search for Hugging Face Transformers documentation on from_pretrained parameter names
# Check what the correct parameter should be
curl -s "https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained" 2>/dev/null | grep -i "torch_dtype\|dtype" | head -20 || echo "Could not fetch docs directly"
Length of output: 23243
🏁 Script executed:
#!/bin/bash
# Check for other usages of from_pretrained in the codebase to see patterns
rg "from_pretrained.*\(dtype|torch_dtype)" examples/ --type py -A 2 -B 2Repository: vllm-project/llm-compressor
Length of output: 195
🏁 Script executed:
#!/bin/bash
# Also check the llama3 file mentioned in the scratchpad to see if it has the same issue
cat -n examples/awq/regression_tests/llama3_awq.py | sed -n '25,35p'Repository: vllm-project/llm-compressor
Length of output: 504
🏁 Script executed:
#!/bin/bash
# Search for other from_pretrained usages to see what pattern is typically used
rg "from_pretrained\(" examples/ --type py -A 1 | head -50Repository: vllm-project/llm-compressor
Length of output: 3643
🏁 Script executed:
#!/bin/bash
# Check if there are any successful uses of torch_dtype in the repo
rg "torch_dtype" --type pyRepository: vllm-project/llm-compressor
Length of output: 2328
Use the correct torch_dtype parameter for model loading.
Line 29 uses dtype="auto", but Transformers' from_pretrained() API expects torch_dtype="auto". The incorrect parameter name is silently ignored, causing the 32B model to load in the wrong precision and invalidating regression test results.
Suggested change
- model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/awq/regression_tests/qwen25_32b_awq.py` around lines 29 - 30, The
model is being loaded with the wrong keyword argument: replace the incorrect
dtype="auto" used in AutoModelForCausalLM.from_pretrained with the correct
torch_dtype="auto" so the 32B model loads in the intended precision; update the
call to AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
(keeping AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
unchanged).
| messages = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| { | ||
| "type": "image", | ||
| "image": "http://images.cocodataset.org/train2017/000000231895.jpg", | ||
| }, | ||
| {"type": "text", "text": "Please describe the animal in this image\n"}, | ||
| ], | ||
| } | ||
| ] | ||
| prompt = processor.apply_chat_template(messages, add_generation_prompt=True) | ||
| image_inputs, video_inputs = process_vision_info(messages) | ||
| inputs = processor( | ||
| text=[prompt], | ||
| images=image_inputs, | ||
| videos=video_inputs, | ||
| padding=False, | ||
| max_length=MAX_SEQUENCE_LENGTH, | ||
| truncation=True, | ||
| return_tensors="pt", | ||
| ).to(model.device) | ||
| output = model.generate(**inputs, max_new_tokens=100) | ||
| print(processor.decode(output[0], skip_special_tokens=True)) |
There was a problem hiding this comment.
Remove the network dependency from the smoke test.
Lines 109-124 pull a remote image for the post-quantization sanity check, and this runs before Lines 133-135 save the quantized artifact. In the harness, a failure here marks the whole block as QUANT_FAILED, so a flaky network can discard a successful quantization for reasons unrelated to AWQ. Reuse one of the already-loaded dataset images or make this generation step best-effort instead.
As per coding guidelines, examples/**/*.py: "Ensure examples follow best practices and are runnable."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/awq/regression_tests/qwen3_vl_awq.py` around lines 103 - 127, The
smoke test currently pulls a remote image URL inside the messages list (the
"image": "http://images.cocodataset.org/...") which introduces a network
dependency that can fail before quantization is saved; update the messages
construction so it uses a local/test image already present in the harness (e.g.,
reuse an existing dataset image buffer or path used earlier) or make the
generation step best-effort by wrapping the vision-processing/generation
sequence (process_vision_info, processor(...), model.generate) in a try/except
that logs the failure and continues without marking the quantization as failed;
target the messages variable, process_vision_info invocation, the processor(...)
call and the model.generate(...) call when applying the change.
| # Find the body for this section | ||
| pattern = re.escape(f"MODEL: {model}") + r".*?" + re.escape(f"CODE STATE: {code_state}") | ||
| match = re.search(pattern, content, re.DOTALL) | ||
| if not match: | ||
| continue | ||
| # Get everything after the match until the next section | ||
| start = match.end() | ||
| next_section = re.search(r"║\s+MODEL:", content[start:]) | ||
| end = start + next_section.start() if next_section else len(content) | ||
| body = content[start:end] |
There was a problem hiding this comment.
This re-search picks the wrong section body for repeated schemes.
Lines 42-49 search the full log again using only MODEL and CODE STATE. Each model/code-state pair appears once per scheme, so every later scheme will reuse the first matching block and get parsed with the wrong metrics. That makes the summary table incorrect for repeated model/scheme combinations.
Suggested fix
- sections = section_pattern.findall(content)
+ sections = list(section_pattern.finditer(content))
@@
- for model, scheme, code_state in sections:
- model = model.strip()
- scheme = scheme.strip()
- code_state = code_state.strip()
-
- # Find the body for this section
- pattern = re.escape(f"MODEL: {model}") + r".*?" + re.escape(f"CODE STATE: {code_state}")
- match = re.search(pattern, content, re.DOTALL)
- if not match:
- continue
- # Get everything after the match until the next section
- start = match.end()
- next_section = re.search(r"║\s+MODEL:", content[start:])
- end = start + next_section.start() if next_section else len(content)
- body = content[start:end]
+ for idx, match in enumerate(sections):
+ model, scheme, code_state = (part.strip() for part in match.groups())
+ start = match.end()
+ end = sections[idx + 1].start() if idx + 1 < len(sections) else len(content)
+ body = content[start:end]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@extract_log_summary.py` around lines 41 - 50, The current code searches the
whole content for the MODEL/CODE STATE pattern so repeats reuse the first global
match; instead restrict the search to the current scheme's block: first locate
the scheme header for this iteration (e.g., find the scheme-start boundary in
content), then run re.search(pattern, scheme_block, re.DOTALL) so match, start,
next_section and body are computed relative to that scheme_block (not the full
content); update the logic around pattern, match, start, next_section and body
to slice content to the scheme range before extracting the MODEL and CODE STATE
section.
| activate_env() { | ||
| source /home/HDCharles/rhdev/bin/activate | ||
| } |
There was a problem hiding this comment.
Don't hardcode a personal virtualenv path.
Line 59 assumes /home/HDCharles/rhdev/bin/activate exists. Everywhere except that workstation, source fails before any regression runs start. Let the caller activate the environment first, or take the venv path from an env var / CLI argument.
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 59-59: Not following: /home/HDCharles/rhdev/bin/activate was not specified as input (see shellcheck -x).
(SC1091)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run_all_tests.sh` around lines 58 - 60, The activate_env() function currently
hardcodes a personal path (/home/HDCharles/rhdev/bin/activate); change it to not
assume that path by either: 1) removing the function and documenting that the
caller must activate the virtualenv, or 2) making activate_env() read a venv
path from an environment variable (e.g. VENV_ACTIVATE or VENV_PATH) or a CLI
argument, validate that the file exists before sourcing, and emit a clear error
and exit if not set/invalid; update any callers to pass the variable or rely on
the caller-activated environment.
| switch_code_state() { | ||
| local state=$1 | ||
|
|
||
| if [ "$state" == "pre-ddp" ]; then | ||
| echo "Switching to pre-DDP code ($PRE_DDP_COMMIT)..." | ||
| git checkout "$PRE_DDP_COMMIT" 2>&1 | ||
| pip install -e . --quiet 2>&1 | ||
| elif [ "$state" == "post-ddp" ]; then | ||
| echo "Switching to post-DDP code ($POST_DDP_COMMIT)..." | ||
| git checkout "$POST_DDP_COMMIT" 2>&1 | ||
| pip install -e . --quiet 2>&1 | ||
| fi |
There was a problem hiding this comment.
Checking out the target commits here removes the very scripts you're about to run.
This harness runs from the current PR branch, but Lines 207-214 switch the repo to historical commits before Lines 278-283 invoke examples/awq/regression_tests/*.py. Because those regression scripts are introduced by this PR, the checkout will remove or replace them from the working tree, so the subsequent quantization command can fail with “file not found” or run the wrong script contents. Use separate worktrees/clones for the pre/post code states, or copy the harness scripts outside the repo before switching revisions.
Also applies to: 278-283
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run_all_tests.sh` around lines 204 - 215, The current switch_code_state()
function checks out historical commits in-place which removes the test harness
and newly added regression scripts (examples/awq/regression_tests/*); change the
approach to avoid in-place checkouts by either (A) using git worktree or a fresh
clone: create a separate worktree/clone for PRE_DDP_COMMIT and POST_DDP_COMMIT,
run pip install -e . in those isolated directories and invoke the regression
scripts from those worktrees, or (B) before any checkout, copy the harness and
examples/awq/regression_tests/* to a temporary directory and run the tests from
that safe copy; update switch_code_state() and the test invocation logic to
reference the worktree/temporary copy instead of the current working tree so the
scripts introduced by the PR are never removed.
| # ── Quantize (skip if model already exists) ──────────── | ||
| if [ -d "$save_dir" ] && [ -f "$save_dir/config.json" ]; then | ||
| echo "Quantized model already exists at $save_dir, skipping quantization." |
There was a problem hiding this comment.
The cleanup step defeats the resumable quantization skip.
Lines 268-269 only skip quantization while $save_dir still exists, but Lines 333-336 delete it unconditionally after every block. On a resumed run, the harness will therefore re-quantize every model/scheme/state even when Lines 309-317 are about to skip all evaluations because results_*.json already exist. If resumability is a goal, skip the whole block up front when all eval outputs are present, or keep the quantized model until that block is fully reusable.
Also applies to: 309-317, 332-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run_all_tests.sh` around lines 267 - 269, The cleanup currently removes
$save_dir unconditionally which defeats the earlier quantization skip; update
the flow in run_all_tests.sh so that before doing any quantization/eval work you
check for the presence of the final evaluation outputs (results_*.json) and, if
they already exist, skip the entire quantize+eval block (the code guarded by the
$save_dir existence check and the later removal code around lines referencing
results_*.json and the rm -rf $save_dir cleanup); alternatively, defer deleting
$save_dir until after you confirm all eval outputs have been produced—ensure
this change touches the quantize check that uses $save_dir and the later cleanup
block that deletes $save_dir so resumable runs do not force re-quantization.
|
The quality checks have failed. Please run |
Summary
2ab02443) vs post-DDP (0bc916e5) AWQ quality across 5 models, 3 quantization formats, and 3 evaluation benchmarksrun_all_tests.sh) with resumable execution, progressive results table, and automatic model cleanupTest Matrix
Usage
Test plan
🤖 Generated with Claude Code